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1  Introduction 

Analysts  face  a  daunting  task:  they  must  accurately 
analyze,  categorize,  and  assimilate  a  large  body  of 
information  from  a  variety  of  sources  and  for  a  va¬ 
riety  of  domains  of  interest.  The  complexity  of  the 
task  necessitates  a  variety  of  information  access  and 
extraction  tools  which  technology  up  to  this  point 
has  not  been  able  to  provide.  SRI’s  TIPSTER  Phase 
III  project  has  focused  on  two  major  obstacles  to  the 
development  of  such  tools;  inadequate  degrees  of  ac¬ 
curacy  and  portability.  We  begin  by  providing  an 
overview  of  SRI’s  information  extraction  (IE)  sys¬ 
tem,  Fastus,  and  then  describe  our  efforts  in  these 
two  areas  in  turn.  We  then  conclude  with  some 
thoughts  concerning  future  directions. 

2  Overview  of  FASTUS 

Fastus  processes  natural  language  and  produces 
representations  of  the  information  relevant  to  a  par¬ 
ticular  application,  typically  in  the  form  of  database 
templates.  As  an  example,  we  consider  the  task 
specified  for  the  Sixth  Message  Understanding  Con¬ 
ference  (MUC-6),  which  was,  roughly  speaking,  to 
identify  information  in  business  news  that  describes 
executives  moving  in  and  out  of  high-level  positions 
within  companies  (Appelt  et  ah,  1995).  When  FAS¬ 
TUS  encounters  a  passage  such  as  example  (1), 

(1)  John  Smith,  47,  was  named  president  of  ABC 
Corp.  He  replaces  Mike  Jones. 

it  should  extract  the  information  that  Mike  Jones  is 
‘out’  and  John  Smith  is  ‘in’  at  the  position  of  presi¬ 
dent  of  company  ABC  Corp. 

Fastus  consists  of  three  major  components.  The 
first  is  the  pattern  recognition  module,  which  consists 
of  a  series  of  finite  state  transducers  that  recognize 
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patterns  in  the  text  and  create  templates  represent¬ 
ing  event  and  entity  descriptions.  Pattern  recogni¬ 
tion  relies  on  a  second  component,  the  coreference 
module.,  which  identifies  the  referents  of  a  variety 
of  types  of  referential  expressions  (e.g.,  pronouns, 
definite  noun  phrases).  Finally,  the  merger  unifies 
templates  created  from  different  phrases  in  the  text 
that  describe  the  same  events. 

We  illustrate  by  walking  through  an  analysis  of 
passage  (1).  The  input  is  initially  processed  by  us¬ 
ing  the  finite  state  transducers  to  recognize  relevant 
patterns  and  annotate  the  text  accordingly.  First, 
one  or  more  preprocessing  phases  recognize  low-level 
patterns  such  as  person  names,  organization  names, 
and  parts  of  speech. 

[John  SmithjpEas-NAME  [M]num  [was]>iLrA' 
[named]  V  [president]  at  [of]p  [ABC 

Corp]ofiG-ArAM£; 

The  parsing  phase  identifies  very  local  syntactic  con¬ 
stituents,  such  as  noun  groups  and  verb  groups;  no 
attachment  of  ambiguous  modifiers  is  attempted. 

[John  Smith]pEfl5_Ar/iMH  [47]num  [wasnamed]vG 
[president]ArG  [of]p  [ABC  Corp]oRG-NAME 

The  combiner  phase  pieces  together  slightly  larger 
constituents  when  it  can  be  done  reliably. 

[John  Smith,  47]pers-ng  [was  named]vG  [presi¬ 
dent  of  ABC  Corp.]pos-ArG 

Finally,  the  domain  phase  applies  domain-dependent 
patterns  to  the  sentence  to  identify  clause-level 
states  and  events.  In  this  case,  the  entire  sentence 
will  match  such  a  pattern. 

[John  Smith,  47,  was  named  president  of  ABC 
GorpJoOMAlN-BVENT 

Recognizing  a  pattern  in  the  domain  phase  typi¬ 
cally  causes  one  or  more  template  objects  to  be  cre¬ 
ated.  In  light  of  the  MUC-6  task  specification,  we 
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defined  transition  templates  that  track  movements 
in  and  out  of  positions  at  companies;  a  person’s 
leaving  a  job  is  represented  by  the  start  state  of  a 
transition,  whereas  a  person’s  taking  a  job  is  rep¬ 
resented  by  the  end  state.  Therefore,  the  person, 
company,  and  position  in  the  first  sentence  of  (1) 
are  represented  in  an  end  state  since  Smith  is  taking 
the  described  position.  To  facilitate  certain  types  of 
inferencing  during  the  merging  phase,  we  also  posit 
that  someone,  at  this  time  unknown,  is  most  likely 
leaving  the  position,  this  being  represented  in  the 
transition’s  start  state  as  shown  in  Figure  1. 


in  Figure  1  to  produce  the  template  shown  in  Fig¬ 
ure  3,  which  will  lead  to  the  correct  output. 


START 


Person  Mike  Jones 
Position  president 
Organization  ABC  Corp. 


END 


Person  John  Smith 
Position  president 
Organization  ABC  Corp. 


Figure  3:  A  Successful  Merge 


■  Person - 

START 

Position  president 

.Organization  ABC  Corp.. 

Person  John  Smith 

END 

Position  president 

.Organization  ABC  Corp.. 

Figure  1:  Template  Generated  from  John  Smith  was 
named  president  of  ABC  Corp. 

The  second  sentence  in  the  passage.  He  replaces 
Mike  Jones.,  is  then  analyzed  by  the  pattern  match¬ 
ing  phases,  the  details  of  which  we  omit.  During 
this  analysis,  the  coreference  module  identifies  John 
Smith  as  the  referent  of  “he”.  Having  recognized  a 
domain-level  pattern,  all  that  is  known  is  that  there 
is  a  start  state  involving  the  person  Mike  Jones  and 
an  end  state  involving  the  person  John  Smith,  rep¬ 
resented  by  the  template  shown  in  Figure  2. 


■  Person  Mike  Jones 

START 

Position - 

.  Organization - 

"Person  John  Smith" 

END 

Position - 

Organization  — 

Figure  2:  Template  Generated  from  He  replaces 
Mike  Jones. 

As  they  stand,  of  course,  these  two  templates  do 
not  appropriately  summarize  the  information  in  the 
text;  there  is  a  discourse-level  relationship  between 
the  two  that  must  be  captured.  This  is  the  job  of  the 
merging  component.  When  a  new  template  is  cre¬ 
ated,  the  merger  attempts  to  unify  it  with  templates 
that  precede  it.  In  this  case,  the  template  shown  in 
Figure  2  should  be  unified  with  the  template  shown 


3  Focus  on  Accuracy 

The  first  major  obstacle  to  the  broad  deployment 
of  IE  technology  we  address  is  the  inadequate  level 
of  accuracy  of  existing  systems.  We  have  sought  to 
push  the  accuracy  of  each  of  the  three  major  modules 
of  Fastus  in  our  TIPSTER  effort. 

3.1  A  Lattice-Based  System  for  Pattern 
Recognition 

One  of  the  main  reasons  for  the  success  of  Fastus  is 
that  it  bypasses  much  of  the  complex  linguistic  pro¬ 
cessing  characteristic  of  previous  systems.  Process¬ 
ing  decisions  are  made  using  local  rather  than  global 
evidence,  minimizing  the  risk  that  correct  analyses 
get  lost  in  a  sea  of  incorrect  ones.  For  instance,  at 
each  phase  in  the  pattern  recognition  component, 
only  the  analysis  deemed  to  be  the  best  is  passed  to 
the  next  phase.  Unfortunately,  while  this  strategy 
has  proved  advantageous  in  general,  in  many  cases 
it  leads  to  premature  processing  decisions  based  on 
too  little  information. 

For  instance,  for  the  following  example, 

(2)  The  committee  heads  announced  the  appoint¬ 
ment  of  John  Smith  as  CEO. 

the  parser  phase  of  Fastus  will  correctly  mark  “the 
committee  heads”  as  a  noun  group.  This  decision  is 
made  because  the  noun  usage  of  “head”  is  more  com¬ 
mon  than  the  verb  usage  in  this  domain,  and  because 
of  a  “greedy”  preference  for  longer  constituents.  Us¬ 
ing  the  same  heuristics  for  example  (3), 

(3)  The  committee  heads  Viacom’s  CEO  recruit¬ 
ment  efforts. 

the  system  will  generate  the  same  analysis  for  “the 
committee  heads” .  Since  “heads”  is  actually  used  as 
a  main  verb  in  this  example  (with  “the  committee” 
as  its  subject),  the  parser’s  incorrect  choice  will  re¬ 
sult  in  there  being  no  domain-phase  analysis  for  the 
sentence. 
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The  trick,  then,  is  to  try  to  improve  the  scope  (and 
thus,  the  accuracy)  of  the  current  mechanisms,  with¬ 
out  adopting  the  inadequacies  of  previous  frame¬ 
works  that  Fastus  was  designed  to  improve  upon. 
To  do  this,  we  implemented  a  lattice-based  version  of 
Fastus.  In  keeping  with  the  finite  state  paradigm, 
the  pattern  recognition  phases  perform  transduc¬ 
tions  over  compact  lattice  representations  of  the  in¬ 
put,  passing  such  representations  between  phases. 
With  a  lattice  representation,  there  are  as  many 
analyses  for  a  string  as  there  are  paths  through  it, 
yet  processing  remains  efficient.  The  processing  of 
examples  (2)  and  (3)  will  result  in  a  lattice  with 
both  possible  analyses  for  “the  committee  heads”. 
In  example  (2),  the  successful  match  will  result  from 
matching  a  path  in  which  “the  committee  heads”  is 
analyzed  as  a  noun  group,  whereas  in  (3),  the  suc¬ 
cessful  domain-phase  match  will  result  from  a  path 
in  which  “the  committee”  is  analyzed  as  a  noun 
group  and  “heads”  is  analyzed  as  a  verb  group. 

Although  compactly  represented,  the  numerous 
analyses  that  can  result  from  lattice-based  process¬ 
ing  still  require  some  methods  for  pruning  and  path 
selection.  To  date,  we  have  implemented  and  eval¬ 
uated  a  variety  of  strategies.  Thus  far,  the  results 
of  these  experiments,  as  measured  by  F-score  on  the 
MUC-6  task,  have  been  somewhat  mixed.  A  typi¬ 
cal  experiment  will  yield  about  one  point  of  gain  in 
F-score;  as  expected,  recall  generally  climbs  with  a 
smaller  sacrifice  in  precision.  We  plan  to  do  further 
experimentation  in  the  future. 

3.2  Improvements  to  Coreference 
Resolution 

We  have  implemented  various  high-precision  and 
largely  domain-independent  incremental  extensions 
to  the  coreference  resolution  module. 

Delayed  Resolution  in  the  Lattice  System 

The  implementation  of  the  lattice-bcised  system 
opened  up  the  possibility  of  addressing  several  coref¬ 
erence  issues  that  could  not  be  cleanly  addressed 
within  the  nonlattice  system.  The  first  is  a  catch- 
22  which  results  from  a  need  to  perform  coreference 
resolution  both  before  and  after  the  domain  phase 
level  of  analysis.  (Recall  that  coreference  resolution 
comes  before  the  domain  phase.)  We  illustrate  with 
example  (4). 

(4)  Analysts  have  been  expecting  IBM  to  announce 
some  changes.  In  fact,  today  they  named  John 
Smith  as  president. 

Let  us  assume,  plausibly  enough,  that  the  domain 
phase  contains  a  pattern  of  the  following  sort,  which 


will  match  the  second  sentence  if  the  referent  of 
“they”  is  a  company  (in  this  case  IBM). 

Event  ;=  Company  named  Person  as  Position 

Coreference  resolution  must  necessarily  apply  be¬ 
fore  the  domain  phase,  since  the  pattern  interpreter 
needs  to  know  whether  the  denotation  of  the  sub¬ 
ject  (the  referent  of  “they”)  is  a  company.  Unfortu¬ 
nately,  in  this  particular  case  the  coreference  module 
is  likely  to  choose  “analysts”  as  the  referent,  since  it 
occupies  the  subject  position  of  the  preceding  clause, 
which  usually  indicates  a  higher  degree  of  salience 
than  the  object  position  that  “IBM”  occupies.  In¬ 
tuitively,  however,  just  the  fact  that  the  system  has 
the  aforementioned  pattern  suggests  that  one  would 
expect  a  company  to  be  situated  at  that  point  in 
the  context  of  the  clause.  Thus,  there  is  reason  to 
want  coreference  to  apply  after  it  has  access  to  that 
pattern,  that  is,  after  domain-phase  processing. 

A  similar  problem  occurs  with  respect  to  intrasen- 
tential  coreference  constraints.  Consider  the  sen¬ 
tence 

(5)  John  Smith  removed  him  from  the  CEO  post. 

Intrasentential  constraints,  dictated  by  the  syntac¬ 
tic  structure  of  the  sentence,  tell  us  that  “him”  can¬ 
not  refer  to  John  Smith.  However,  only  the  domain 
phase  has  a  notion  of  sentence-level  syntax,  so  the 
system  has  no  way  of  knowing  of  the  applicability 
of  this  constraint  given  that  the  coreference  module 
operates  before  this  phase. 

The  lattice-based  system  provides  a  way  to  in¬ 
corporate  and  preserve  ambiguities  through  the  do¬ 
main  phase,  and  thus  offers  an  opportunity  to  ad¬ 
dress  these  problems.  Instead  of  selecting  only  the 
most  preferred  referent  for  a  referential  expression, 
the  coreference  module  takes  the  set  of  alternatives 
and  writes  arcs  for  each  onto  the  lattice  in  place 
of  the  referential  phrase,  including  relative  levels  of 
preference.  This  lattice  then  serves  as  input  to  the 
domain  phase,  as  before,  at  which  point  the  above 
constraints  can  be  enforced.  In  the  case  of  exam¬ 
ple  (4),  for  instance,  the  path  in  which  “they”  is 
rewritten  as  “analysts”  will  not  result  in  a  success¬ 
ful  match,  whereas  the  path  in  which  it  is  rewritten 
as  “IBM”  will.  Alternatively,  if  both  potential  ref¬ 
erents  were  company  names,  then  the  one  that  the 
coreference  module  considers  to  be  most  preferred 
will  be  selected. 

Contributions  of  the  coreference  and  lattice  com¬ 
ponents  were  independently  measured  on  an  earlier 
baseline  system.  We  observed  that  both  components 
increase  the  recall  and  precision  independently,  with 
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the  recall  (error  reduction  of  8%  to  10%)  much  more 
affected  than  the  precision  (error  reduction  of  1%  to 
2%).  The  coreference-lattice  combination  leads  to 
an  even  greater  increase  in  the  recall  (13%  error  re¬ 
duction)  but  less  impact  on  the  precision  (less  than 
1%  error  reduction).  After  this  evaluation,  we  real¬ 
ized  that  the  logic  of  the  coreference-lattice  integra¬ 
tion  was  incomplete,  because  of  certain  destructive 
operations  that  should  not  be  maintained  in  a  strat¬ 
egy  in  which  alternatives  are  preserved.  We  expect 
an  even  greater  performance  impact  when  integra¬ 
tion  is  completed. 

Extensions  to  Coverage  We  also  implemented 
extensions  to  the  coverage  of  the  coreference  module. 

First,  we  implemented  a  module  for  resolving  im¬ 
plicit  arguments  for  certain  relational  nouns.  Re¬ 
lational  nouns  are  those  whose  denotations  are  de¬ 
termined  in  association  to  a  possessor  entity.  In 
the  business-political  domain,  position  nouns  such 
as  CEO  and  vice  president  are  relational,  associated 
with  (sometimes  implicit)  organizations  at  which 
these  positions  exist.  As  a  first  step,  we  added  a 
mechanism  for  resolving  implicit  organizations  for 
position  expressions  similar  to  those  for  pronoun  res¬ 
olution.  This  addition  increased  the  IE  recall  by  al¬ 
most  a  point  (0.85%),  a  nontrivial  gain  for  a  change 
of  relatively  limited  scope. 

We  also  added  a  resolution  routine  for  definite 
temporal  expressions.  Indexicals  such  as  “today”, 
“next  week”,  “last  Monday”,  and  “10  years  ago”  are 
resolved  with  respect  to  the  document  date.  Par¬ 
tial  temporal  expressions  such  as  “Friday”  and  “the 
23rd”  are  resolved  with  respect  to  the  combination 
of  the  closest  verb  tense  and  the  salient  date  in  the 
global  or  local  context.  The  globally  salient  date  is 
the  document  date,  whereas  the  locally  salient  date 
is  the  most  recent  date  mentioned  in  the  text.  The 
performance  of  the  date  resolution  routine  was  eval¬ 
uated  with  eight  training  articles  containing  a  total 
of  53  definite  date  expressions.  Among  the  currently 
intended  coverage  of  43  expressions,  37  were  cor¬ 
rectly  resolved.  We  can  interpret  it  as  having  69.8% 
recall  (37/53)  and  86.0%  precision  (37/43). 

Fragment  Analysis  After  we  observed  the  ef¬ 
fectiveness  of  implicit  argument  resolution  as  de¬ 
scribed  above,  we  added  a  domain-specific  treatment 
of  what  we  call  fragment  analysis.  Fastus  often 
finds  fragments  of  domain  patterns  in  texts  because 
of  insufficient  domain  coverage — an  inevitable  limi¬ 
tation,  given  the  ability  for  natural  language  to  ex¬ 
press  the  same  content  in  many  different,  often  un¬ 
predictable  surface  realizations.  Consider  the  follow¬ 
ing  example. 


(6)  John  Doe,  who  is  known  for  his  “my  way  or  the 
highway”  management  style,  but  who  nonethe¬ 
less  receives  rave  reviews  from  industry  insiders, 
even  his  enemies,  was  named  president  of  IBM. 

In  this  case,  Fastus  is  likely  to  match  the  fragment 
“was  named  president  of  IBM,”  outputting  a  tran¬ 
sition  with  a  position  and  organization.  Unfortu¬ 
nately,  given  the  intervening  material  between  this 
fragment  and  the  subject,  it  will  also  most  likely  fail 
to  link  the  transition  to  the  incoming  person,  John 
Doe.  The  fragment  analysis  code  corrects  this  by  in¬ 
specting  each  transition  created  for  a  sentence,  and, 
assuming  that  a  substantial  but  incomplete  template 
is  found,  attempts  to  locate  candidates  from  the 
surrounding  discourse  context  to  replace  the  empty 
slots.  The  overall  effect,  specifically  of  making  par¬ 
tial  domain  event  templates  more  complete,  is  sim¬ 
ilar  to  that  of  the  merging  phase.  The  difference  is 
that  while  merging  combines  two  or  more  partially 
filled  domain  events,  missing  argument  resolution 
fills  empty  slots  of  each  domain  event  with  recently 
mentioned  entities  even  if  they  are  not  associated 
with  extracted  events.  We  compared  the  effects  of 
fragment  analysis  and  merging  on  the  overall  score 
using  the  100  message  MUC-6  training  set.  The  re¬ 
sult  is  shown  in  Table  1;  fragment  analysis  alone 
performed  better  than  merging  alone,  with  the  two 
together  performing  the  best. 

An  Analysis  of  WordNet  Sanda  Harabagiu,  a 
former  post-doctoral  fellow  at  SRI,  performed  an 
analysis  of  how  WordNet  might  be  used  to  improve 
coreference  resolution,  particularly  by  exploiting  hy- 
pernym  and  synonym  information.  Using  the  MUC- 
6  coreference  training  messages  as  her  corpus,  she 
found  that  60%  of  the  coreference  examples  fall 
into  categories  in  which  WordNet  is  of  no  poten¬ 
tial  use:  Cases  of  identity  between  strings  (e.g.,  “a 
company.. .the  company”)  comprised  42.3%  of  the 
examples,  and  crises  in  which  coreference  is  indi¬ 
cated  by  syntactic  configuration  (e.g.,  appositives,  as 
in  “John  Smith,  president  of  Acme  Widgets”)  com¬ 
prised  18.27%  of  the  examples. 

Reference  involving  a  synonym  relation  made  up 
8.33%  of  the  examples.  Of  these,  3.1%  were  syn¬ 
onyms  in  WordNet,  such  as  “bill”  and  “measure”. 
However,  5.23%  were  not  in  WordNet.  Some  of  these 
cases  one  could  imagine  being  in  such  a  knowledge 
source,  such  as  “business”  and  “company”;  it  just 
so  happens  that  they  are  not.  On  the  other  hand, 
there  are  also  more  difficult  cases,  such  as  “IBM” 
and  “wounded  computer  giant” ,  for  which  no  knowl¬ 
edge  base  is  likely  to  contain  a  relation. 

Reference  involving  a  hypernym  relation  made  up 
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Merging 

Fragment  Analysis 

Precision 

Recall 

F-score 

Off 

Off 

71 

42 

52.60 

Off 

On 

68 

48 

56.10 

On 

Off 

65 

52 

57.67 

On 

On 

64 

55 

59.17 

Table  1:  Contributions  of  Fragment  Analysis  and  Merging 


11.0%  of  the  cases.  Of  these,  3.7%  were  in  Word- 
Net,  such  as  “quarter”  and  “period”,  and  “chair¬ 
man”  and  “officer”.  The  other  7.3%  were  not  in 
WordNet.  Again,  there  were  cases  which  one  could 
imagine  being  there,  such  as  “automaker”  and  “com¬ 
pany”.  Others,  however,  such  as  “Clinton  officials” 
and  “Clinton  camp”,  are  not  likely  to  be  found  in 
any  such  knowledge  base. 

The  remaining  cases  were  often  more  difficult; 
many  involving  metonymy. 

3.3  Learning  Merging  Strategies 

Early  in  the  project,  we  performed  an  analysis  of  the 
errors  Fastus  made  on  a  subset  of  the  MUC-6  devel¬ 
opment  corpus.  The  majority  of  the  errors  indicted 
merging  at  least  in  part,  suggesting  that  merging 
improvements  had  a  potential  for  high  payoff. 

The  existing  Fastus  merging  algorithm  is  quite 
simple  -  it  attempts  to  merge  newly  created  tem¬ 
plates  with  previous  ones,  starting  with  the  most 
recent.  Templates  are  merged  when  they  are  unifi- 
able  in  accordance  with  any  prespecified  constraints. 
Despite  its  simplicity,  the  algorithm  has  proven  to 
be  fairly  successful.  Nonetheless,  it  is  quite  possi¬ 
ble  that  other  merging  strategies  could  yield  better 
results. 

There  are  two  ways  in  which  one  might  attempt 
to  identify  such  strategies.  First,  one  could  perform 
data  analyses  to  identify  good  merging  principles, 
handcode  them,  and  test  the  results.  Alternatively, 
one  could  attempt  to  have  merging  strategies  be 
acquired  by  the  system  automatically,  using  some 
training  mechanism.  We  attempted  both  of  these, 
which  we  discuss  in  turn. 

Data  Analyses  and  Experimentation  The 

first  action  we  took  was  to  perform  an  extensive 
analysis  of  merging  results.  We  developed  detailed 
mechanisms  for  tracing  merging  behavior  and  dis¬ 
tributed  transcripts  among  several  project  partici¬ 
pants.  In  analyzing  these,  we  identified  a  variety  of 
constraints  which  appeared  to  be  extremely  reliable, 
in  particular,  characteristics  of  templates  that  were 
almost  always  correlated  with  incorrect  merges. 

One  by  one,  these  constraints  were  implemented 


and  tested.  In  each  case,  end-to-end  performance 
on  the  scenario  template  task  either  remained  the 
same  or  decreased  slightly.  In  no  case  did  we  get  a 
nontrivial  increase  in  performance. 

This  was  rather  puzzling  and  frustrating,  and 
highlighted  some  of  the  problems  with  handcoding 
system  improvements.  For  one,  the  processes  of  data 
analysis,  system  coding,  and  testing  are  labor  inten¬ 
sive.  One  cannot  try  all  possible  alternative  sets  of 
constraints  one  might  consider,  so  one  can  never  be 
sure  that  other,  unattempted  constraints  would  not 
have  fared  better.  Second,  it  could  be  that  we  were 
being  misguided  by  the  relatively  small  data  sets 
that  we  were  analyzing  by  hand.  Thus,  we  began 
considering  other  paradigms  for  identifying  better 
merging  strategies. 

There  were  also  other,  longer-term  considerations 
for  moving  away  from  handcoding  merging  improve¬ 
ments.  For  one,  the  optimal  merging  strategy  is 
highly  dependent  on  the  quality  of  the  input  it  re¬ 
ceives,  which  is  constantly  evolving  in  any  realis¬ 
tic  development  setting,  thus  requiring  continual  re¬ 
experimentation.  Thus,  changes  that  improve  per¬ 
formance  at  one  point  in  system  development  could 
potentially  decrease  performance  at  another  time,  or 
vice  versa.  Second,  a  general  goal  of  IE  research  is 
to  have  systems  that  can  be  trained  for  new  applica¬ 
tions  long  after  the  system  developers  are  involved, 
which  precludes  experimentation  by  hand. 

These  considerations  motivate  research  to  deter¬ 
mine  if  merging  strategies  can  be  learned  automat¬ 
ically.  There  are  several  different  types  of  learning, 
including  supervised,  unsupervised,  and  an  area  in 
between  which  one  might  call  indirectly  supervised. 
We  have  performed  experiments  using  all  three  types 
of  technique,  which  we  describe  below. ^ 

’  The  work  reported  on  here,  also  discussed  in  Kehler 
(1998),  concerns  learning  merging  strategies  in  support 
of  the  scenario  template  task  of  MUC-6  as  described  in 
Section  2.  While  we  are  unaware  of  any  other  reported 
research  on  this  task,  other  work  has  addressed  other 
MUC-style  tasks.  For  instance,  Kehler  (1997)  describes 
a  probabilistic  approach  to  entity-level  merging  that  out¬ 
performs  several  baseline  metrics.  Also,  researchers  at 
BBN  (Ralph  Weischedel,  TIPSTER  18-month  meeting) 
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Supervised  Methods  In  our  first  set  of  experi¬ 
ments,  we  took  the  approach  most  commonly  pur¬ 
sued  in  the  computational  linguistics  literature, 
namely  supervised  learning.  Supervised  methods  re¬ 
quire  a  set  of  training  data  that  the  learning  algo¬ 
rithm  can  consult  in  constructing  its  model.  For  our 
initial  experiments,  we  ran  the  100  MUC-6  train¬ 
ing  messages  through  Fastus  and  wrote  out  feature 
signatures  for  the  534  merges  that  the  system  per¬ 
formed.  The  feature  signatures  were  created  by  2isk- 
ing  a  set  of  50  questions  about  the  context  in  which 
the  proposed  merge  is  taking  place,  referencing  the 
content  of  the  two  templates  and/or  the  distance 
between  the  phrases  from  which  each  template  was 
created.  Some  example  questions  are: 

•  SUBSUMED?:  true  if  the  contents  of  one  tem¬ 
plate  completely  subsume  the  contents  of  the 
other. 

•  UNNAMED-REFERENCES?:  true  if  either 
transition  has  a  slot  filled  with  an  object  lack¬ 
ing  a  proper  name,  e.g.,  “an  employee”  in  the 
person  slot.  While  these  objects  can  merge 
with  other  (perhaps  named)  entities  of  the  same 
type,  in  general  they  should  not. 

•  LESS-THAN-700-CHARS?:  true  if  the  phrases 
from  which  the  templates  are  created  are  less 
than  700  characters  apart  in  the  text. 

After  the  feature  signatures  were  written,  we  exam¬ 
ined  the  texts  and  manually  encoded  a  key  for  each. 

We  attempted  two  approaches  to  classifying 
merges  using  this  corpus  as  training  data.  The 
first  was  to  grow  a  classification  tree  in  the  style  of 
Breiman  et  al.  (1984).  At  each  node,  the  algorithm 
asks  each  question  and  selects  the  one  resulting  in 
the  purest  split  of  the  data.  Entropy  was  used  as  the 
measure  of  node  purity.  In  the  second  set  of  exper¬ 
iments,  we  used  the  approach  to  maximum  entropy 
modeling  described  by  Berger  et  al.  (1996).  The  two 
possible  values  for  each  of  the  same  50  questions  (i.e., 
yes  or  no)  were  paired  with  each  of  the  two  possi¬ 
ble  outcomes  for  merging  (i.e.,  correct  merge  or  not) 
to  create  a  set  of  feature  functions,  or  features  for 
short,  which  were  used  in  turn  to  define  constraints 
on  a  probabilistic  model.  We  used  the  learned  max¬ 
imum  entropy  model  as  a  classifier  by  considering 
any  merge  with  a  probability  strictly  greater  than 
0.5  to  be  correct,  and  otherwise  incorrect. 

report  on  learned  merging  strategies  achieving  good  per¬ 
formance  on  the  less  complex  template  entity  and  tem¬ 
plate  relation  tasks  in  MUC-7,  although  no  compeirison 
with  a  similar  hand-coded  system  was  provided. 


Out  of  the  available  set  of  questions,  each  ap¬ 
proach  selects  only  those  that  are  most  informative 
for  the  classifier  being  developed.  In  the  case  of  the 
decision  tree,  questions  are  selected  based  on  how 
well  they  split  the  data.  In  the  case  of  maximum 
entropy,  the  algorithm  approximates  the  gain  in  the 
model’s  predictiveness  that  would  result  from  im¬ 
posing  the  constraints  corresponding  to  each  of  the 
existing  inactive  features,  and  selects  the  one  with 
the  highest  anticipated  payoff.  One  potential  advan¬ 
tage  of  maximum  entropy  is  that  it  does  not  split 
data  like  a  decision  tree  does,  which  may  prove  im¬ 
portant  as  training  sets  will  necessarily  be  limited  in 
their  size. 

In  our  preliminary  evaluations,  we  used  two-thirds 
of  our  annotated  corpus  as  a  training  set  (356  exam¬ 
ples),  and  the  remaining  one-third  as  a  test  set  (178 
examples).  We  ran  experiments  using  three  different 
such  divisions,  using  each  example  twice  in  a  train¬ 
ing  set  and  once  in  a  test  set.  In  each  case  the  maxi¬ 
mum  entropy  classifier  chose  features  corresponding 
to  either  6  or  7  of  the  available  questions,  whereas 
the  decision  tree  classifier  asked  anywhere  from  7  to 
14  questions  to  get  to  the  deepest  leaf  node.  In  each 
case  there  was  considerable,  but  not  total,  overlap  in 
the  questions  utilized.  Adding  the  errors  from  the 
three  evaluations  together,  the  decision  tree  made 
34  errors  (out  of  a  possible  534),  in  which  13  correct 
merges  were  classified  as  incorrect  and  21  incorrect 
merges  were  classified  as  correct.  The  maximum  en¬ 
tropy  classifier  made  a  total  of  31  errors,  in  which 
14  correct  merges  were  classified  as  incorrect  and 
17  incorrect  merges  were  classified  as  correct.  This 
is  compared  to  a  total  of  139  errors  out  of  the  534 
merges  that  the  current  merger  made  according  to 
the  annotations. 

These  results  may  appear  to  be  positive,  as  it 
would  seem  that  both  methods  found  some  reliable 
information  on  which  to  make  classifications.  How¬ 
ever,  our  goal  here  was  to  improve  end-to-end  per¬ 
formance  on  the  scenario  template  task,  and  thus  we 
wanted  to  know  how  much  of  an  impact  these  im¬ 
proved  merging  strategies  have  on  that  performance. 
Therefore,  we  replaced  the  existing  Fastus  merg¬ 
ing  algorithm  with  two  more  discriminating  mergers, 
each  directed  by  one  of  our  learned  classifiers.  The 
first  version  consulted  the  decision  tree  and  merged 
only  when  the  example  was  classified  as  correct.  The 
second  version  did  the  same  using  the  maximum  en¬ 
tropy  classifier.  For  these  experiments,  the  two  mod¬ 
els  were  trained  using  the  entire  set  of  534  examples. 

As  we  were  still  experimenting  at  this  point,  we 
were  not  ready  to  perform  an  evaluation  using  our 
set  of  blind  test  messages.  As  an  information  gath- 
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ering  experiment,  we  applied  Fastus  using  the  new 
mergers  to  the  corpus  of  messages  that  produced 
the  training  data.  We  would  of  course  expect  these 
experiments  to  yield  better  results  than  when  ap>- 
plied  to  unseen  messages.  Nonetheless,  the  results 
were  humbling  -  both  experiments  failed  to  improve 
the  performance  of  the  overall  system,  and  in  fact 
degraded  it  slightly.  Generally,  a  point  of  precision 
was  gained  at  the  expense  of  a  point  or  two  of  recall. 

Clearly,  there  is  a  rift  between  what  one  might 
consider  to  be  good  performance  at  discriminating 
correct  and  incorrect  merges  based  on  human  judg¬ 
ments,  and  the  effect  these  decisions  have  on  over¬ 
all  performance.  Because  the  baseline  FastUS  algo¬ 
rithm  merges  too  liberally,  using  the  classifiers  cause 
many  of  the  incorrect  merges  that  were  previously 
performed  to  be  blocked,  at  the  expense  of  blocking 
a  smaller  number  of  correct  merges.  Thus,  it  is  possi¬ 
ble  that  the  correct  merges  the  system  performs  help 
its  end-to-end  performance  much  more  than  incor¬ 
rect  merges  hurt  it.  For  instance,  it  may  be  that  cor¬ 
rect  merges  often  result  in  well-populated  templates 
that  have  a  marked  impact  on  performance,  whereas 
incorrect  merges  may  often  add  only  one  incorrect 
slot  to  an  otherwise  correct  template,  or  even  result 
in  templates  that  do  not  pass  the  threshold  for  ex- 
tractability  at  all.  In  fact,  in  certain  circumstances 
incorrect  merges  can  actually  help  performance,  if 
two  incorrect  templates  that  would  produce  incor¬ 
rect  end  results  are  unified  to  become  one. 

In  any  case,  it  should  be  clear  that  improved  per¬ 
formance  on  an  isolated  subcomponent  of  an  IE  sys¬ 
tem,  as  measured  against  human  annotations  for 
that  subcomponent,  does  not  necessarily  translate 
to  improved  end-to-end  system  performance.  Add 
this  to  the  cost  of  creating  this  annotated  data  - 
which  will  continually  become  obsolete  as  the  up¬ 
stream  Fastus  modules  undergo  development  -  and 
it  becomes  clear  that  we  need  to  look  to  other  meth¬ 
ods  for  learning  merging  mechanisms. 

Unsupervised  Methods  Naturally,  the  main  al¬ 
ternatives  to  supervised  methods  are  unsupervised 
methods.  We  consider  replacing  our  merging  algo¬ 
rithm  with  one  that  performs  an  unsupervised  clus¬ 
tering  of  the  templates  and  merges  the  templates  in 
each  cluster.  Of  course,  we  will  not  know  a  priori 
how  many  clusters  there  are,  that  is,  how  many  tem¬ 
plates  we  should  be  left  with  when  we  are  finished.  A 
method  that  does  not  require  such  knowledge  is  Hi¬ 
erarchical  Agglomerative  Clustering  (HAC)  (Duda 
and  Hart,  1973;  Everitt,  1980,  inter  alia). 

The  HAC  algorithm  is  conceptually  straightfor¬ 
ward.  Given  a  set  of  examples,  the  algorithm  begins 


by  assigning  each  to  its  own  cluster.  A  predeter¬ 
mined  similarity  metric  is  then  applied  to  each  pair¬ 
wise  combination  of  clusters,  and  the  most  similar 
pair  combined.  The  process  is  iterated  until  no  pair 
of  clusters  have  a  similarity  that  exceeds  a  preset 
threshold. 

Our  application  of  clustering  is  somewhat  different 
from  many  problems  to  which  clustering  has  been 
applied.  For  one,  our  clusters  will  always  have  only 
one  member,  since  templates  are  merged  upon  clus¬ 
tering.  Issues  with  how  to  compute  similarity  be¬ 
tween  two  nonsingleton  sets  of  data  points  are  there¬ 
fore  avoided.  Furthermore,  our  notion  of  similarity  is 
nonstandard.  Usually,  similar  examples  are  distinct, 
but  have  properties  that  are  “close”  to  each  other  in 
some  space.  Here,  similarity  is  meant  to  measure  the 
likelihood  that  the  two  templates  are  incomplete  de¬ 
scriptions  of  the  same  complex  of  eventualities  (i.e., 
the  same  transition),  although  the  templates  them¬ 
selves  may  look  very  difierent. 

We  performed  some  informal  experiments  in 
which  we  intuited  a  similarity  metric,  assigning 
weights  to  a  subset  of  the  questions  that  we  had 
defined  for  the  supervised  learning  experiments.  For 
instance,  templates  that  were  created  from  phrases 
close  to  each  other  in  the  text  and  that  overlapped  in 
content  received  high  similarity,  whereas  those  that 
were  far  apart  and  did  not  overlap  received  low  sim¬ 
ilarity.  Instead  of  merging  incrementally  as  in  the 
supervised  learning  experiments,  pattern  matching 
weis  first  applied  to  the  entire  text,  and  the  resulting 
templates  were  clustered  and  merged  until  no  pair 
of  templates  passed  a  preset  similarity  threshold. 

Running  the  system  over  the  MUC-6  development 
set  yielded  results  similar  to  our  experiments  using 
the  supervised  mergers.  We  did  not  find  this  to  be 
particularly  surprising;  for  instance,  the  mediocre 
results  could  be  attributable  to  the  similarity  metrics 
not  being  very  good. 

We  did  not  push  this  approach  any  further,  be¬ 
cause  it  is  still  lacking  with  respect  to  one  of  our 
goals  for  pursuing  learning  strategies.  While  it  ad¬ 
dresses  the  problem  of  requiring  annotated  training 
data,  it  does  not  address  the  fact  that  the  optimal 
merging  strategy  is  inherently  dependent  on  its  in¬ 
put.  If  we  encode  a  similarity  metric  for  clustering 
and  keep  it  fixed,  we  are  left  with  only  a  single  degree 
of  freedom  -  the  similarity  threshold  at  which  to  halt 
the  clustering  process.  While  this  may  yield  some 
leverage  (for  instance,  good  input  to  the  merger  may 
call  for  a  high  threshold,  whereas  bad  input  may  call 
for  a  lower  threshold),  it  will  certainly  be  too  inflex¬ 
ible  in  the  general  case. 

In  sum,  several  factors  could  influence  the  likeli- 
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hood  of  a  potential  merge  within  a  particular  appli¬ 
cation,  and  it  therefore  seems  that  something  tied  to 
the  application  needs  to  guide  the  learning  process. 

Indirectly  Supervised  Methods  When  devel¬ 
oping  an  IE  system,  one  typically  encodes  (or  is 
given)  a  moderate-size  set  of  end-to-end  develop)- 
ment  keys  for  a  set  of  sample  messages.  These  keys 
need  to  be  encoded  only  once.  We  did  not  use  these 
keys  for  supervised  learning  because  of  the  difficul¬ 
ties  in  aligning  the  inaccurate  and  incomplete  inter¬ 
mediate  templates  produced  by  the  system  with  the 
(normalized)  end  results.  However,  we  can  use  the 
keys  to  evaluate  the  end  results  of  the  system,  and 
attempt  to  tune  a  merging  strategy  based  on  these 
evaluations.  After  all,  it  is  improved  end-to-end  per¬ 
formance  that  we  are  seeking  in  the  first  place. 

Thus,  we  consider  a  form  of  what  we  are  calling  in¬ 
directly  supervised  learning.  We  use  the  HAC  mech¬ 
anism  described  in  the  previous  section,  but  attempt 
to  learn  the  similarity  metric  instead  of  stating  it 
explicitly.  The  search  through  the  space  of  possi¬ 
ble  similarity  metrics  will  be  driven  by  end-to-end 
performance  on  a  set  of  training  messages. 

We  start  by  defining  a  space  of  similarity  met¬ 
rics.  In  a  preliminary  experiment,  we  used  7  of  the 
questions  that  were  used  in  the  supervised  experi¬ 
ments,  coupled  with  their  negations,  for  a  total  of 
14  questions.  These  questions  are  assigned  weights, 
either  positive  or  negative,  that  get  incorporated 
into  a  similarity  metric  when  the  question  is  true 
of  a  potential  merge.  Let  A,'  be  the  weights  assigned 
to  corresponding  questions  qi,  and  let  the  function 
fq,(ti^t2)  be  1  if  the  question  qi  is  true  of  the  tem¬ 
plates  ti  and  <2)  and  0  if  not.  Then  the  similarity 
S{ti,t2)  is  given  by 


This  function,  which  is  adapted  from  the  form  of 
the  probability  model  used  in  the  maximum  entropy 
framework,  provides  a  similarity  measure  in  terms 
of  a  probability. 

We  used  an  annealing  strategy  to  tune  the  weights 
Xj.  The  algorithm  begins  by  processing  the  100- 
message  MUC-6  development  set,  usually  with  a 
randomly  selected  initial  configuration  that  estab¬ 
lishes  a  baseline  F-score.  The  algorithm  then  iter¬ 
ates,  selecting  some  of  the  questions  at  random  (per¬ 
haps  just  one,  perhaps  all  of  them)  and  permuting 
their  weights  by  a  random  amount,  either  positive  or 
negative.  The  system  is  then  rerun  over  the  training 
set  and  the  F-score  measured.  Any  permutation  re¬ 
sulting  in  an  F-score  that  is  strictly  greater  than  the 


current  baseline  is  adopted  as  the  new  baseline.  To 
stay  out  of  local  maxima,  a  permutation  leading  to  a 
decrease  in  performance  may  also  be  adopted.  This 
is  the  annealing  part  -  such  negative  permutations 
are  accepted  with  a  probability  that  is  proportional 
to  a  steadily  decreasing  measure  of  ‘temperature’, 
and  inversely  proportional  to  the  magnitude  of  the 
decrement  in  performance.  Thus,  permutations  that 
decrease  performance  slightly  in  early  stages  of  the 
search  are  likely  to  be  adopted,  whereas  permuta¬ 
tions  that  decrease  performance  either  significantly 
or  in  later  stages  of  the  search  are  not. 

The  results  of  one  of  several  experiments  are 
shown  in  Figure  4.  The  search  began  with  an  initial 
similarity  metric  achieving  an  F-score  of  58.83,  and 
continued  for  300  iterations.  A  low  F-score  of  57.70 
Wcis  achieved  early,  in  iteration  10.  The  best  metrics 
considered  yielded  an  F-score  of  59.80. 

Obviously,  and  somewhat  surprisingly,  this  graph 
is  practically  flat.  On  one  hand,  it  is  unfortunate 
that  there  aren’t  higher  high  points:  The  learner 
Wcis  not  able  to  leverage  the  available  features  to  ac¬ 
quire  a  much  better  merging  strategy  than  the  one 
it  started  with.  Perhaps  even  more  surprising,  how¬ 
ever,  is  that  there  were  also  not  lower  low  points  - 
only  iteration  10  achieved  a  score  lower  than  58.  Be¬ 
cause  the  learner  was  not  given  any  bias  with  respect 
to  the  permutations  it  attempted,  some  of  those  it 
considered  were  intuitively  poor  (e.g.,  boosting  the 
weight  for  phrases  that  are  very  far  apart,  lowering 
the  weight  for  sparsely  filled  templates  with  no  over¬ 
lap).  Thus,  one  might  have  expected  certain  of  these 
to  devastate  performance,  but  none  did.  It  seems 
that  as  long  as  a  certain  amount  of  merging  is  per¬ 
formed,  it  matters  less  which  templates  are  actually 
merged,  and  in  what  order. 

Conclusions  and  Future  Directions  In  sum, 
the  learned  mechanisms  were  neither  significantly 
better  nor  worse  than  a  hand-coded  merging  strat¬ 
egy.  The  inability  to  outperform  the  existing  strat¬ 
egy  could  be  attributed  to  several  facts.  We  sus¬ 
pect  that  a  major  problem  is  the  lack  of  accessi¬ 
ble,  reliable,  and  informative  indicators  for  merg¬ 
ing  decisions.  Unlike  lower-level  problems  in  natural 
language  processing  (NLP)  in  which  local  informa¬ 
tion  appears  to  bear  highly  on  the  outcome,  includ¬ 
ing,  for  instance,  part-of-speech  tagging  (Church, 
1988;  Brill,  1992,  inter  alia)  and  sense  disambigua¬ 
tion  (Yarowsky,  1994;  Yarowsky,  1995,  inter  alia), 
none  of  the  questions  we  have  formulated  appear  to 
be  particularly  indicative  of  what  effect  a  potential 
merge  will  have  on  system  performance.  This  sug¬ 
gests  that  more  research  is  needed  to  identify  ways 
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Figure  4:  Results  of  a  Learning  Experiment 


to  access  the  necessary  knowledge  from  independent 
sources  such  as  existing  knowledge  bases,  or  by  min¬ 
ing  it  from  online  text  corpora  using  unsupervised 
or  indirectly  supervised  learning  techniques. 

Furthermore,  these  experiments  may  be  cause  for 
concern  about  the  nature  of  the  scoring  metric  and 
procedure  used  in  MUC-6.  All  of  the  merging  strate¬ 
gies  attempted,  both  hand-coded  and  automatically 
learned,  performed  similarly.  This  (rather  unex¬ 
pected)  result  would  suggest  that  the  scoring  mech¬ 
anisms  be  given  a  closer  look,  which  we  do  in  the 
following  section. 

3.4  Analysis  of  the  Scoring  System 

As  we  have  indicated,  the  lack  of  more  significant 
progress  in  some  of  the  foregoing  efforts  had  us  puz¬ 
zled.  Intuitively  positive  system  changes  were  not 
showing  much  effect  in  terms  of  end-to-end  perfor¬ 
mance,  nor  were  certain  intuitively  negative  changes. 
Of  course,  judgments  of  what  constitute  positive  and 
negative  changes  are  only  as  good  as  the  scoring 
mechanism  which  is  providing  the  feedback.  As  part 
of  a  related  project  at  SRI,  we  began  to  find  some 
more  concrete  evidence  that  at  times  this  feedback 
has  been  misguiding  our  efforts.  Incremental  refine¬ 
ments  in  the  system’s  output,  ones  that  should  yield 
superior  results,  nevertheless  receive  a  lower  score 
from  the  scoring  mechanism. 

The  following  text  (WSJ  article  870112-0001)  pro¬ 
vides  an  example  illustrating  this  point: 

(7)  The  board  also  named  a  three-man  executive 
committee  to  perform  the  chief  executive’s  role. 
The  three  members  are  Victor  Steele,  head  of 
the  company’s  beverage  division;  Brian  Bal- 
dock,  head  of  the  leisure  and  health  division; 


and  Shaun  Dowling,  who  runs  industrial  oper¬ 
ations. 


Further  executive  resignations  or  dismissals  are 
widely  expected.  The  positions  of  Olivier  Roux, 
head  of  financial  planning,  and  Thomas  Ward, 
a  U.S.  attorney  who  is  a  close  aide  to  Mr. 
Saunders,  are  “open  to  question,”  one  Guinness 
source  said. 

Fastus  does  poorly  on  this  example,  for  under¬ 
standable  reasons.  It  did  not  produce  any  succes¬ 
sion  events  for  the  first  paragraph,  because  doing  so 
would  require  resolving  a  variety  of  difficult  linguis¬ 
tic  issues  lying  beyond  the  depth  of  processing  at 
which  Fastus  operates.  On  the  other  hand,  for  rea¬ 
sons  that  won’t  be  described  in  detail,  the  system 
generated  a  succession  event  from  the  second  para¬ 
graph  involving  the  position  “head  of  financial  plan¬ 
ning”,  with  four  IN-AND-OUT  templates  involving 
Roux,  Saunders,  and  two  other  people  mentioned  in 
the  article. 

While  not  much  could  be  done  for  the  first  para¬ 
graph,  we  modified  Fastus  so  that  it  would  not  pro¬ 
duce  a  template  from  the  second  paragraph.  The 
change  to  the  system  performance  on  this  message 
has  to  be  positive:  while  we  do  not  generate  any 
additional  correct  information  from  the  change,  we 
eliminated  four  predications  about  an  irrelevant  po¬ 
sition,  three  of  which  would  be  false  even  if  one  con¬ 
sidered  the  position  to  be  relevant.  Other  output 
for  this  text  was  not  affected,  so  we  would  expect 
to  observe  the  same  recall  (correct  output  was  not 
changed),  but  notably  higher  precision  from  having 
eliminated  the  incorrect  succession  event,  four  incor- 
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rect  IN-AND-OUTs,  and  two  irrelevant  PERSON 
templates. 

In  reality,  this  change  resulted  in  a  slight  rise  in 
precision  (from  57  to  59)  and  a  dramatic  reduc¬ 
tion  in  recall  (from  50  to  33),  causing  the  F-score 
on  this  message  to  plummet  from  53.30  to  42.67. 
The  reward  for  eliminating  four  irrelevant  predica¬ 
tions  was  a  20%  drop  in  the  score.  This  result  is,  to 
say  the  least,  counterintuitive,  and  suggests  serious 
problems  in  the  ability  of  the  scoring  mechanism  to 
provide  adequate  feedback. 

We  have  several  speculations  regarding  the  causes 
of  this  behavior,  but  final  conclusions  await  a  more 
comprehensive  study.  It  should  be  obvious  in  any 
case,  however,  that  further  progress  in  IE  is  crucially 
dependent  on  these  issues  being  resolved.  While  this 
is  true  regardless  of  the  approach  one  takes  to  system 
development,  it  is  especially  so  if  we  want  to  move 
toward  systems  with  rules  and  procedures,  that  are 
learned  automatically.  Successful  learning  depends 
on  the  assumption  that  learned  improvements  are 
reflected  in  the  evaluation  function;  if  this  is  not  the 
case  then  learning  is  all  but  hopeless.  Thus,  future 
research  in  IE  must  be  coupled  with  research  into 
evaluation  strategies. 

3.5  The  Zipf  Effect  on  Information 
Extraction  Applications 

A  fundamental  question  with  respect  to  IE  applica¬ 
tions  is  the  nature  of  the  Zipf  curve  relating  pattern 
development  to  improved  coverage.  In  a  given  appli¬ 
cation,  there  is  usually  a  small  set  of  patterns  which 
will  have  broad  applicability  -  that  is,  they  are  likely 
to  match  on  many  examples  in  any  given  set  of  un¬ 
seen  data.  For  instance,  a  MUC-6  pattern  designed 
to  match  the  sentence 

(8)  John  Smith  was  appointed  CEO  of  IBM. 

will  almost  certainly  match  many  other  similar  ex¬ 
amples  also.  At  the  other  end  of  the  spectrum, 
there  are  many  ‘one-of-a-kind’  examples  in  any  given 
training  corpus  for  which  the  corresponding  pattern 
is  unlikely  to  match  many  other  examples.  For  in¬ 
stance,  a  pattern  developed  to  handle  the  sentence 

(9)  John  Smith  and  his  associate,  Roger  Jones,  the 
former  of  which  will  soon  be  on  board  at  IBM 
and  the  latter  of  which  will  be  heading  to  Ap¬ 
ple,  are  in  line  to  be  CEO  and  chairman,  re¬ 
spectively. 

is  unlikely  to  match  other  examples  in  any 
reasonably-sized  corpus  of  unseen  data.  The  big 
question,  then,  is  at  what  point  in  development  do 


the  great  majority  of  examples  fall  into  the  second 
class;  at  this  point  performance  gains  on  training 
data  do  not  transfer  to  gains  on  test  data.  It  could 
very  well  be  the  case  that  after  developing  patterns 
to  handle  the  examples  in  a  moderately-sized  train¬ 
ing  set  -  say  100  messages,  as  in  the  MUC-6  training 
corpora  -  one  hcis  reached  the  point  of  diminishing 
returns. 

In  support  of  a  project  related  to  TIPSTER,  the 
Office  of  Research  and  Development  provided  us 
with  an  additional  set  (90  messages)  of  data  with 
keys  annotated  in  accordance  with  the  MUC-6  task 
specification.  This  gave  us  an  opportunity  to  see 
whether  new  improvements  inspired  by  this  data 
would  transfer  to  the  test  data.  The  changes  we  im¬ 
plemented  were  all  relatively  minor.  They  included: 

•  Fixing  a  few  problems  in  name  recognition 

•  Adding  a  parser  phase  pattern 

•  Adding  domain  phase  patterns  for  a  few 
metaphorical  expressions 

•  Eliminating  a  filter  for  irrelevant  texts 

•  Fixing  other  minor  bugs 

These  modifications  caused  our  score  on  the  new 
training  data  to  increase  from  46.4  to  52.1,  which 
is  not  a  surprising  result.  Given  that  the  fixes  were 
directed  narrowly  at  specific  examples  in  this  set, 
we  did  not  expect  to  see  much  of  an  improvement 
in  either  of  the  other  data  sets.  Our  suspicions  were 
confirmed  by  results  on  the  basic  training  data;  our 
score  on  this  set  went  from  58.6  to  59.6.  Quite  sur¬ 
prisingly,  however,  our  score  on  the  blind  test  set 
rose  significantly,  from  51.7  to  57.1  -  an  increase  of 
over  10%. 

Thus,  necessarily  adding  a  proviso  about  the  ade¬ 
quacy  of  the  evaluation  metrics  per  the  last  section, 
we  have  a  negative  data  point  for  the  hypothesis  that 
100  training  messages  place  us  beyond  the  point  of 
diminishing  returns.  The  second  set  of  messages  ap¬ 
parently  had  considerable  overlap  with  the  test  data 
in  areas  that  did  not  overlap  with  the  original  train¬ 
ing  set. 

4  Focus  on  Portability 

A  second  major  obstacle  to  the  broad  utilization  of 
IE  technology  is  the  time  and  expertise  needed  to 
develop  new  systems.  Users  need  to  be  able  to  de¬ 
velop  extraction  systems  for  new  information  needs 
rapidly  and  without  the  assistance  of  a  system  de¬ 
veloper.  We  have  been  developing  infrastructure, 
consisting  of  patterns,  ontologies,  and  tools,  which 
brings  us  closer  to  these  capabilities. 
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4.1  Open  Domain  System 

The  majority  of  previously  pursued  IE  tasks,  in¬ 
cluding  those  in  the  MUC  evaluations,  have  been 
centered  on  extracting  information  from  a  narrowly 
defined  domain.  Alternatively,  one  might  imagine 
developing  a  system  capable  of  extracting  informa¬ 
tion  about  a  significantly  broader  set  of  events  that 
might  potentially  be  of  interest  to  an  analyst.  We 
call  such  a  system  an  open  domain  application. 

We  are  currently  completing  our  implementation 
of  an  open  domain  system  for  business  news.  The 
system  is  built  upon  an  infrastructure  consisting  of  a 
broad  set  of  patterns  and  ontologies.  These  patterns 
and  ontologies  will  serve  as  a  basis  for  the  analyst 
to  produce  special-purpose  IE  systems  (which  we 
call  Fastlets)  for  specific  information  needs.  Such 
Fastlets  could  be  used  not  only  for  database  gen¬ 
eration,  but  also  to  improve  systems  for  document 
and  subdocument  retrieval  and  for  task-driven  sum¬ 
marization,  among  other  applications. 

The  patterns  and  ontologies  were  developed  from 
an  in-depth  analysis  of  the  150  most  common  verbs 
and  nominalizations  within  a  corpus  of  Wall  Street 
Journal  texts.  A  frequency  analysis  was  performed 
to  identify  these  verbs  and  nominalizations,  and  a 
list  was  generated  of  all  the  sentences  in  the  corpus 
containing  each.  A  chart  was  then  constructed  for 
each  group,  listing  each  verb  and  its  role  fillers  (sub¬ 
ject,  object,  prepositional  objects).  This  gave  rise  to 
the  patterns  required  to  cover  the  examples,  and  the 
elements  and  organization  of  an  ontology  emerged. 
A  few  example  patterns  are  shown  below. 

Person  analyzes  {  Industry  |  Commodity  \ 

Financial-Instrument  } 

{  Company  |  Person  }  controls  Company 

{  Company  |  Country  }  exports  Goods  to 

Country 

Coperorg  invests  Money  in  {  Financial- 

Instrument  I  Market  \  Country  \  Company  } 

The  italicized  elements  indicate  concepts  in  the  de¬ 
veloped  ontologies;  for  instance,  Coperorg  is  a  cat¬ 
egory  subsuming  several  other  concepts  including 
Person,  Company,  and  Organization. 

Open  domain  patterns  are  integrated  with  the 
compile-time  transformation  component  of  Fastus. 
This  component  is  capable  of  taking  a  single  pattern 
and  specifying  the  different  ways  in  which  it  can  be 
expressed  in  English.  Thus,  the  first  pattern  in  .the 
list  above  will  not  only  match  sentence  (10), 

(10)  John  Smith  analyzed  the  automobile  industry. 


but  it  will  also  match  examples  such  as  (11)  and 

(12). 

(11)  The  automobile  industry  has  been  analyzed  by 
John  Smith. 

(12)  John  Smith’s  analysis  of  the  automobile  indus¬ 
try... 

The  output  of  the  open  domain  pattern  set  is  a 
case-frame  style  template,  marking  roles  and  modi¬ 
fiers  such  as  agent,  patient,  location,  time,  and  pur¬ 
pose. 

Open  Domain  and  Rule  Acquisition  As  we 

have  indicated,  one  of  the  ways  in  which  the  open 
domain  infrastructure  can  be  used  is  as  the  basis 
for  allowing  end  users  to  construct  their  own  pat¬ 
terns  tailored  to  their  own  information  needs.  The 
development  process  will  be  much  like  what  expert 
developers  do  to  build  systems,  except  that  there 
will  be  a  richer  set  of  tools  for  doing  so.  For  in¬ 
stance,  in  our  MUC-6  effort,  we  first  outlined  the 
events  of  interest,  and  then  scanned  training  texts 
to  determine  the  verbs  and  nominalizations  that  en¬ 
coded  those  events.  We  then  categorized  them  into 
classes  of  verbs  with  the  same  case  frames,  and  wrote 
subject-verb-object  patterns  for  each  of  the  classes. 

We  are  currently  developing  an  interface  that  will 
allow  end  users  to  accomplish  this.  Analysts  will  se¬ 
lect  the  open  domain  patterns  that  are  relevant  to 
their  needs,  and  constrain  their  arguments  in  appro¬ 
priate  fashions.  The  system  will  support  testing  on 
existing  corpora  and  provide  assistance  for  further 
rule  adaptation.  The  interface  is  being  implemented 
in  Java. 

4.2  An  Application:  Using  IE  to  Improve 
Document  Retrieval 

As  we  have  mentioned  previously,  one  of  the  possi¬ 
ble  uses  for  Fastlets  is  to  improve  the  quality  of 
document  retrieval  (DR)  results.  We  discuss  some 
of  our  past  and  current  work,  as  well  as  future  plans. 

Completed  Experiments  In  work  predating 
TIPSTER  Phase  III,  a  topic  was  chosen  from  the 
TREC-5  corpus  which  overlapped  significantly  with 
the  MUC-6  management  succession  topic.  SRI’s 
MUC-6  system  was  used  to  reorder  the  retrieval  re¬ 
sults  from  the  UMASS  Inquery  ad-hoc  query  system, 
based  on  the  results  of  finite  state  pattern  matching. 
This  experiment  produced  a  positive  result,  which, 
while  far  from  being  definitive,  suggested  that  fur¬ 
ther  investigation  should  be  performed.  Of  course, 
the  scenario  that  was  being  tested  is  not  realistic,  as 
such  highly  developed  IE  systems  will  not  generally 
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exist  for  most  information  needs.  A  more  reasonable 
scenario  would  be  one  in  which  a  rapidly  developed 
Fastlet  is  used  to  perform  such  a  task. 

During  TIPSTER  Phase  III,  SRI  teamed  with 
GE  R&D  to  participate  in  the  TREC-6  evaluation. 
Fastlets  were  developed  for  23  of  the  47  topics  in 
the  TREC-6  routing  task,  in  which  up  to  4  hours 
per  topic  was  spent  reading  a  small  set  of  relevant 
texts  and  writing  a  small  number  of  grammar  rules 
and  lexical  attributes.  Each  Fastlet  was  then  run 
overnight  on  some  additional  training  data,  and  an¬ 
other  1  to  2  hours  was  spent  (on  average)  making  any 
necessary  adjustments,  for  a  total  of  an  average  of  4 
to  6  hours  per  topic.  The  majority  of  these  Fastlet 
grammars  were  developed  by  a  Stanford  undergrad¬ 
uate,  who  has  the  characteristics  one  might  expect 
the  end  user  of  such  a  system  to  have:  he  is  smart 
and  computer  literate,  but  knows  essentially  nothing 
about  NLP,  linguistics,  IE,  and  DR. 

In  the  GE/SRI  joint  TREC-6  entry,  the  routing 
query  version  of  GE’s  DR  system  was  used  to  pro¬ 
vide  the  top  2000  ranked  documents  for  each  topic. 
The  Fastlets  were  then  used  to  rerank  the  list  and 
produce  the  top  1000.  The  results  were  encourag¬ 
ing,  albeit  again  not  definitive.  In  abstract  terms, 
the  GE/SRI  system  improved  on  the  results  of  the 
GE  system  alone  for  16  topics,  degraded  them  for 

5  topics,  and  received  the  same  results  on  2  topics. 
Of  the  16  topics  in  which  the  results  were  improved, 
in  2  cases  the  improvement  was  very  significant,  in 

6  cases  the  improvement  was  significant,  and  in  8 
cases  the  improvement  was  small  and  insignificant. 
For  one  topic,  ours  was  the  best  performing  system. 
Of  the  5  cases  in  which  the  results  were  degraded, 
in  3  cases  the  decline  was  significant  and  in  2  cases 
it  was  very  significant. 

These  results  are  encouraging  in  that  they  indicate 
that  the  Fastlet  approach  to  improving  DR  may 
be  fe£isible,  considering  that  in  at  least  some  cases 
NLP  techniques  improved  the  results  of  an  already 
competitive  routing  query  system. 

An  Ongoing  Study  The  results  of  the  forego¬ 
ing  experiments  are  especially  encouraging  consid¬ 
ering  that  they  were  achieved  using  a  highly  subop- 
timal  overall  architecture.  The  DR  and  IE  systems 
were  treated  as  black  boxes:  the  DR  system  ranked 
documents  using  standard  DR  types  of  evidence 
(word  frequency  analysis),  and  then  the  Fastlets 
reranked  the  documents  based  on  pattern  matching 
evidence,  without  considering  (or  even  having  access 
to)  the  DR  evidence.  All  the  Fastlets  had  access 
to  was  the  output  ordering.  In  actuality,  it  is  likely 
that  both  types  of  evidence  are  useful  for  relevance 


determination,  and  that  the  relative  usefulness  of 
each  varies  on  a  per-topic  basis.  What  is  needed  is 
an  architecture  in  which  the  DR  and  IE  evidence 
is  considered  together,  with  a  principled  mechanism 
for  selecting  the  most  informative  features  for  docu¬ 
ment  relevance  on  a  per-topic  basis. 

We  are  currently  pursuing  such  an  architecture, 
which,  in  addition  to  certain  modifications  to  Fas- 
TUS,  requires  a  research-level  DR  capability.  We 
have  implemented  a  variety  of  word  collection  and 
frequency  analysis  mechanisms  which  leverage  the 
considerable  tokenization  and  morphological  anal¬ 
ysis  capabilities  of  Fastus.  We  have  also  imple¬ 
mented  several  learning  algorithms  capable  of  incor¬ 
porating  and  weighing  heterogeneous  types  of  evi¬ 
dence. 

In  order  to  speed  up  Fastus  processing  on  large 
data  collections,  we  implemented  a  “trigger  word” 
compiler  for  Fastus  grammars.  The  mechanism 
reads  in  a  pattern  set  and  generates  a  list  of  words 
required  to  match  them.  Any  sentence  that  does 
not  contain  a  word  on  the  list  can  be  ignored  after 
early  stages  of  processing.^  Initial  experiments  have 
indicated  a  speed  up  of  more  than  a  factor  of  three. 

Mechanisms  for  generating  relevance  features 
from  Fastlet  results  are  currently  being  imple¬ 
mented,  in  preparation  for  the  learning  experiments. 
We  will  report  on  the  results  of  these  experiments  in 
a  future  forum. 

5  Conclusions  and  Future  Directions 

We  have  summarized  SRI’s  developments  in  address¬ 
ing  two  major  obstacles  to  the  broad  deployment  of 
IE  technology:  accuracy  and  portability.  The  TIP¬ 
STER  program  has  witnessed  significant  progress  in 
both  areas,  and  has  perhaps  witnessed  even  greater 
progress  in  our  understanding  of  IE  technology. 

We  believe  that  the  current  state  of  IE  technol¬ 
ogy  suggests  two  main  directions  for  future  work; 
directions  which  look  to  opposite  directions  of  the 
research-to-applications  spectrum.  The  first  direc¬ 
tion  is  to  leverage  the  progress  we  have  made  to 
embed  IE  technology  within  applications  in  which 
it  can  be  useful.  Candidate  applications  include 
document  retrieval,  task- based  summarization,  task- 
based  machine  translation,  cross-document  and  mul¬ 
timedia  fusion,  and  trend  analysis.  Current  progress 

^Although  it  should  be  noted  that  every  sentence 
needs  to  be  processed  up  through  the  combiner  phase 
if  coreference  is  to  work  optimally,  since  referents  for 
referential  expressions  can  occur  in  otherwise  irrelevant 
sentences.  The  degree  to  which  ignoring  this  fact  af¬ 
fects  performance  is  an  empiriceil  question,  which  will  be 
studied  in  future  work. 
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prepares  us  well  for  such  investigations,  the  critical 
question  being  whether  current  levels  of  accuracy  are 
sufficient  for  success. 

Our  work  has  also  suggested  that  if  we  are  to 
achieve  revolutionary  (rather  than  merely  evolution¬ 
ary)  improvements  in  the  state-of-the-art,  we  also 
need  to  step  back  and  focus  on  fundamental  re¬ 
search.  Current  approaches  are  good  at  identifying 
the  information  that  natural  language  “wears  on  its 
sleeve”;  the  remainder  will  require  new  and  richer 
techniques.  Basic  research  is  necessary  to  guide  the 
development  of  such  mechanisms,  and  must  be  cou¬ 
pled  with  an  investigation  into  evaluation  mecha¬ 
nisms. 
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